Customer Segmentation & Association
🎳 Problem Statement:
Develop a comprehensive data analysis pipeline to explore, preprocess, and derive insights from a given dataset. The analysis should include exploratory data analysis (EDA), dimensionality reduction using techniques like PCA, clustering to identify patterns, and association rule mining for discovering relationships between variables. The goal is to extract valuable insights from the data and provide actionable recommendations for decision-making.
🤔 Dataset Information
📌 Notebook Objectives
The objective of this study is to analyze the dataset to identify and understand the economic and demographic factors that influence respondents' perspectives and behaviors. By employing data analysis techniques and machine learning algorithmns
This case study aims analyze:- How do economic factors such as income and education level correlate with customers perspectives and behaviors?
- Are there any significant relationships between marital status, family size, and customers attitudes or behaviors?
- Can we identify and segments of customers based on their demographic profiles and behaviors?
- What insights can we derive patterns from the responses regarding customers preferences, concerns, or tendencies?
- What insights can we derive patterns from the responses regarding customers preferences, concerns, or tendencies?
By addressing these questions, we aim to gain a deeper understanding of the factors driving customers perspectives and behaviors, which can be used for decision-making in areas such as marketing strategies, policy development, and social interventions.
# --- Importing Libraries ---
from IPython.display import display, HTML, Javascript
import numpy as np
import pandas as pd
from ydata_profiling import ProfileReport
import warnings
warnings.filterwarnings("ignore")
import os
import joblib
#For Visualizations
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
import plotly.express as px
import plotly.graph_objects as go
# Creating instance of the Color class
class Color:
# Define color codes
start = '\033[91m'
end = '\033[0m'
text = '\033[94m'
clr = Color()
palette = ["#4361EE", "#7209B7", "#3A0CA3", "#4CC9F0","#F72585"]
# Creating Preprocessing Pipeline
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import OneHotEncoder, OrdinalEncoder, MinMaxScaler, StandardScaler
from sklearn.compose import ColumnTransformer
from sklearn.model_selection import train_test_split
# For Dimensionality Reduction Using PCA
from sklearn.decomposition import PCA
#For Clustering
from yellowbrick.cluster import KElbowVisualizer
from sklearn.cluster import KMeans
from sklearn.cluster import AgglomerativeClustering
from sklearn.metrics import silhouette_score
from sklearn.cluster import DBSCAN
# Anomaly Detection
from sklearn.ensemble import IsolationForest
# Association Rule Mining
from mlxtend.preprocessing import TransactionEncoder
from mlxtend.frequent_patterns import apriori, association_rules
import networkx as nx
# --- Importing Dataset ---
df = pd.read_excel("survey.xlsx")
questions = pd.read_csv("questions.csv")
# Reading Dataset
print(clr.start + '.: Survey Dataset :.' + clr.end)
print(clr.text + '*' * 23)
styled_df = df.head(10).reset_index(drop=True).style.background_gradient(cmap='Blues').set_table_styles([{'selector': 'tr:hover', 'props': [('background-color', '')]}])
styled_df
.: Survey Dataset :. ***********************
| Designation | Age | Marital | Family | Education | Income | Q1 | Q2 | Q3 | Q4 | Q5 | Q6 | Q7 | Q8 | Q9A | Q9B | Q9C | Q9D | Q9E | Q10 | Q11 | Q12A | Q12B | Q13A | Q13C | Q13D | Q14A | Q14B | Q14C | Q14D | Q14E | Q14F | Q15A | Q15B | Q15C | Q15D | Q15E | Q16A | Q16B | Q16C | Q16D | Q16E | Q16F | Q17A | Q17B | Q18B | Q18C | Q18E | Q19A | Q19C | Q19D | Q19E | Q19F | Q19G | Q19H | Q20 | Q21 | Q22 | Q23 | Q24 | Q25 | Q26 | Q27 | Q28 | Q29 | Q30 | Q31 | Q32 | Q33B | Q33C | Q33E | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | Assistant_Prof | 2 | Married | Nuclear | Pursuing_PhD | 3 | Sometimes_Depends | No | Yes | Co | Hu | Decide | Yes | No | Himself_and_Family | Himself | Himself | Himself_and_Partner | None_of_Above | Yes | No | Ma | Nearby | None_of_Above | All | None_of_Above | Partner | Family | None_of_Above | Himself | Himself_and_Partner | None_of_Above | Himself_and_Family | Himself | Himself_and_Partner | Family | None_of_Above | Partner | Himself | Family | Partner | Family | None_of_Above | Fac | Nearby | Partner | Himself | Himself | Family | None_of_Above | None_of_Above | Himself_and_Partner | None_of_Above | Himself | Family_and_Partner | OC | No | Yes | Yes | Yes | Yes | Yes | No | Yes | Yes | Yes | Yes | Yes | Branded | Branded | Branded |
| 1 | Assistant_Prof | 1 | Married | Joint | No_PhD | 1 | Frequently | No | Yes | Co | Hu | So | Yes | Yes | All | Family | Family | Himself | None_of_Above | No | No | Nearby | Nearby | Family | Himself | None_of_Above | Family | Family | Family_and_Partner | Himself | None_of_Above | None_of_Above | Family | None_of_Above | None_of_Above | None_of_Above | None_of_Above | Family | Family | None_of_Above | None_of_Above | All | None_of_Above | Branded | Nearby | Family | None_of_Above | None_of_Above | Family | Himself | Himself | Himself_and_Partner | None_of_Above | None_of_Above | Family | MS | Yes | No | Yes | Yes | Yes | Yes | No | No | Yes | Yes | Yes | Yes | Company | None_of_Above | None_of_Above |
| 2 | Assistant_Prof | 2 | Married | Nuclear | No_PhD | 3 | Frequently | Yes | Yes | Myself | Hu | Decide | Yes | No | All | Himself | Himself | None_of_Above | None_of_Above | Yes | No | Branded | Nearby | None_of_Above | All | None_of_Above | None_of_Above | None_of_Above | Himself_and_Family | Partner | None_of_Above | None_of_Above | Himself | Family_and_Partner | Partner | None_of_Above | None_of_Above | All | None_of_Above | None_of_Above | None_of_Above | Himself_and_Family | None_of_Above | Nearby | Nearby | Himself_and_Family | None_of_Above | Partner | None_of_Above | Family | Himself_and_Family | None_of_Above | Partner | None_of_Above | None_of_Above | All | Yes | No | Yes | Yes | Yes | Yes | Yes | No | Yes | Yes | Yes | Yes | Branded | None_of_Above | Branded |
| 3 | Assistant_Prof | 1 | Married | Nuclear | No_PhD | 1 | Frequently | No | Yes | Myself | Hu | Decide | Yes | No | Partner | Family | Himself | Himself_and_Partner | None_of_Above | No | No | Ma | Nearby | All | None_of_Above | None_of_Above | None_of_Above | Family | Himself | Partner | None_of_Above | None_of_Above | None_of_Above | None_of_Above | None_of_Above | None_of_Above | Family | None_of_Above | None_of_Above | None_of_Above | All | None_of_Above | None_of_Above | Ma | Nearby | All | None_of_Above | None_of_Above | None_of_Above | None_of_Above | None_of_Above | None_of_Above | None_of_Above | None_of_Above | All | NP | Yes | No | Yes | Yes | Yes | No | No | No | Yes | Yes | Yes | Yes | None_of_Above | Ma | None_of_Above |
| 4 | Assistant_Prof | 1 | Married | Nuclear | No_PhD | 1 | Always | Yes | Yes | Myself | Hu | So | Yes | Yes | All | Family_and_Partner | Family_and_Partner | None_of_Above | None_of_Above | Yes | No | Ma | Nearby | Family | All | None_of_Above | Family | None_of_Above | All | Family | None_of_Above | None_of_Above | None_of_Above | All | Family | Family | Family | Partner | None_of_Above | Family | Himself_and_Partner | None_of_Above | None_of_Above | All | Ma | Himself | None_of_Above | Family_and_Partner | Family | Himself_and_Partner | None_of_Above | Family | None_of_Above | None_of_Above | All | NP | Yes | No | Yes | Yes | Yes | Yes | No | Yes | Yes | Yes | Yes | Yes | All | None_of_Above | Nearby |
| 5 | Assistant_Prof | 1 | Married | Joint | No_PhD | 1 | Always | No | Yes | Co | Hu | Decide | Yes | Yes | Himself_and_Partner | Family_and_Partner | Family | Partner | None_of_Above | Yes | No | All | Ma | Family | All | None_of_Above | Family | Family | Himself | Partner | None_of_Above | Family | Family | None_of_Above | Himself | Partner | Family | Partner | Partner | None_of_Above | All | Himself_and_Family | None_of_Above | All | Ma | All | None_of_Above | All | Family_and_Partner | Family | Himself | None_of_Above | None_of_Above | None_of_Above | Partner | All | Yes | No | Yes | Yes | Yes | Yes | No | Yes | No | Yes | Yes | Yes | All | None_of_Above | Nearby |
| 6 | Assistant_Prof | 1 | Married | Nuclear | Pursuing_PhD | 1 | Always | Yes | Yes | Co | Hu | So | No | No | All | All | None_of_Above | None_of_Above | None_of_Above | Yes | Yes | All | Nearby | None_of_Above | All | None_of_Above | None_of_Above | Family | Himself_and_Partner | None_of_Above | None_of_Above | None_of_Above | None_of_Above | None_of_Above | None_of_Above | None_of_Above | None_of_Above | None_of_Above | None_of_Above | None_of_Above | All | All | All | Fac | Ma | All | None_of_Above | All | None_of_Above | None_of_Above | Himself_and_Partner | Family | None_of_Above | None_of_Above | None_of_Above | All | Yes | Yes | Yes | Yes | Yes | No | Yes | No | Yes | Yes | Yes | Yes | Branded | Branded | None_of_Above |
| 7 | Assistant_Prof | 1 | Married | Joint | No_PhD | 1 | Frequently | No | Yes | Myself | Hu | Decide | Yes | Yes | All | None_of_Above | Partner | None_of_Above | None_of_Above | Yes | No | Ma | Nearby | None_of_Above | All | None_of_Above | None_of_Above | Family | Himself_and_Partner | None_of_Above | None_of_Above | None_of_Above | None_of_Above | All | Himself | None_of_Above | None_of_Above | None_of_Above | None_of_Above | None_of_Above | None_of_Above | All | None_of_Above | Ma | Nearby | All | None_of_Above | None_of_Above | None_of_Above | None_of_Above | All | None_of_Above | None_of_Above | None_of_Above | None_of_Above | MS | No | Yes | Yes | No | No | Yes | Yes | Yes | No | Yes | Yes | Yes | Ma | None_of_Above | None_of_Above |
| 8 | Assistant_Prof | 1 | Married | Nuclear | No_PhD | 1 | Frequently | Yes | Yes | Co | Hu | Hu | Yes | No | Himself | Himself | Family | Himself_and_Partner | Himself | Yes | No | Ma | Ma | None_of_Above | All | None_of_Above | None_of_Above | Family | Himself_and_Partner | None_of_Above | None_of_Above | Partner | Family | Himself | Partner | Family_and_Partner | Family | None_of_Above | Himself | None_of_Above | All | Family | Partner | Ma | Ma | Himself | None_of_Above | Family | None_of_Above | Family | Himself_and_Partner | None_of_Above | None_of_Above | Partner | Family | NP | Yes | No | Yes | Yes | Yes | No | Yes | No | Yes | Yes | Yes | Yes | Company | None_of_Above | None_of_Above |
| 9 | Assistant_Prof | 1 | Married | Nuclear | Pursuing_PhD | 1 | Always | No | Yes | Myself | Hu | Hu | Yes | No | All | Himself | Family | None_of_Above | None_of_Above | Yes | No | Branded | Nearby | None_of_Above | All | None_of_Above | None_of_Above | None_of_Above | All | None_of_Above | None_of_Above | None_of_Above | All | None_of_Above | Himself | None_of_Above | None_of_Above | None_of_Above | None_of_Above | All | None_of_Above | All | None_of_Above | Fac | Nearby | All | None_of_Above | All | None_of_Above | None_of_Above | None_of_Above | None_of_Above | None_of_Above | None_of_Above | All | Fam | No | Yes | Yes | Yes | No | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Branded | None_of_Above | None_of_Above |
# Questions Dataset
print(clr.start + '.: Questions Dataset :.' + clr.end)
print(clr.text + '*' * 23)
styled_df = questions.head(10).reset_index(drop=True).style.background_gradient(cmap='Blues').set_table_styles([{'selector': 'tr:hover', 'props': [('background-color', '')]}])
styled_df
.: Questions Dataset :. ***********************
| Designation | Age | Marital | Family | Education | Income | Q1 | Q2 | Q3 | Q4 | Q5 | Q6 | Q7 | Q8 | Q9A | Q9B | Q9C | Q9D | Q9E | Q10 | Q11 | Q12A | Q12B | Q13A | Q13C | Q13D | Q14A | Q14B | Q14C | Q14D | Q14E | Q14F | Q15A | Q15B | Q15C | Q15D | Q15E | Q16A | Q16B | Q16C | Q16D | Q16E | Q16F | Q17A | Q17B | Q18B | Q18C | Q18E | Q19A | Q19C | Q19D | Q19E | Q19F | Q19G | Q19H | Q20 | Q21 | Q22 | Q23 | Q24 | Q25 | Q26 | Q27 | Q28 | Q29 | Q30 | Q31 | Q32 | Q33B | Q33C | Q33E | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | Designation | Age in Years | Marital Status | Family Type | Education | Approximately Monthly Income | I am involved in the purchase of these product category | I only take decisions for any purchase in the family | I am involved in the decision making for the purchase in the family | When I purchase for myself I take the decision by | When I purchase for myself I consult | I take a decision when a purchase is to be made for | I bargain | I purchase products when I get the bargain | I take the decision myself to buy these products | I consult Family/others to buy these items | I take the decision based on choices of family members for the following product category | I buy these branded products online | I buy these non-branded products online | I go for shopping for one particular item and end up purchasing other items also | I shop till I drop | I buy branded products from these shops | I buy non-branded products from these shops | I buy these products in the Morning | I buy these products in the Evening | I buy these products in the Night | I buy these products Daily | I buy these products Weekly | I buy these products Monthly | I buy these products Quarterly | I buy these products Half Yearly | I buy these products Yearly | I buy these products on Weekly Holiday | I buy these products on Holiday | I buy these products on Festivals | I buy these products on Family functions | I buy these products on Birthdays | I buy these products from Branded Retailers | I buy these products from Company Showroom | I buy these products from Factory Outlet | I buy these products from Malls | I buy these products from Nearby Retailer | I buy these products from Roadside shop | I buy branded products from these shops | I buy non-branded products from these shops | I buy these products by using Debit Card | I buy these products by using Credit Card | I buy these products in Cash | I buy these products only one quantity | I buy these products in Weekly Quantity | I buy these products in Monthly Quantity | I buy these products in Quarterly Quantity | I buy these products in Half Yearly Quantity | I buy these products in Yearly Quantity | I buy these products as per need | I refer this for offers | Normally I visit only one shop which I know | I visit number of shops till I get what I want | I like to buy from shops which have lots of variety | I like to buy from shops where the sales people are cordial | I don't like to buy from shops where sales people promote specific products or show products of their choice | I take the opinion of the sales people of shops | I would like to buy from shops which are open upto 10-11 pm | I like to buy from shops which sells at Fixed rates | I like to buy from shops which offer discounts | I like to buy from shops which also offer services to customize the product to suit my requirement | I like to buy from shops which allow trial or give demo | I like to buy from shops where sales people give enough information about product | I prefer to buy from shops which accept Debit Card | I prefer to buy from shops which accept Credit Card | I prefer to buy from shops which accept Cheque |
| 1 | Respondent_Information | Respondent_Information | Respondent_Information | Respondent_Information | Respondent_Information | Respondent_Information | Demographic_Information | Demographic_Information | Demographic_Information | Decision_Making_Behavior | Decision_Making_Behavior | Decision_Making_Behavior | Bargaining_Behavior | Bargaining_Behavior | Decision_Making | Decision_Making | Decision_Making | Decision_Making | Decision_Making | Shopping_Habits | Shopping_Habits | Shopping_Habits | Shopping_Habits | Purchase_Timing | Purchase_Timing | Purchase_Timing | Purchase_Timing | Purchase_Timing | Purchase_Timing | Purchase_Timing | Purchase_Timing | Purchase_Timing | Special_Occasions | Special_Occasions | Special_Occasions | Special_Occasions | Special_Occasions | Purchase_Location | Purchase_Location | Purchase_Location | Purchase_Location | Purchase_Location | Purchase_Location | Payment_Preferences | Payment_Preferences | Payment_Preferences | Payment_Preferences | Quantity_and_Frequency | Quantity_and_Frequency | Quantity_and_Frequency | Quantity_and_Frequency | Quantity_and_Frequency | Quantity_and_Frequency | Quantity_and_Frequency | Preference_for_Offers | Preference_for_Offers | Shopping_Behavior | Shopping_Behavior | Shopping_Behavior | Shopping_Behavior | Shopping_Behavior | Shopping_Behavior | Shopping_Behavior | Shopping_Behavior | Shopping_Behavior | Shopping_Behavior | Shopping_Behavior | Shopping_Behavior | Payment_Method | Payment_Method | Payment_Method |
print(clr.text + '.: DataSet Description :.' + clr.end)
print("-"*30)
info = pd.DataFrame(df.isnull().sum(),columns=["IsNull"])
info.insert(1,"IsNa",df.isna().sum(),True)
info.insert(2,"Duplicate",df.duplicated().sum(),True)
info.insert(3,"Unique",df.nunique(),True)
info.insert(4,"Min",df.min(),True)
info.insert(5,"Max",df.max(),True)
info = info.T
info
.: DataSet Description :.
------------------------------
| Designation | Age | Marital | Family | Education | Income | Q1 | Q2 | Q3 | Q4 | ... | Q26 | Q27 | Q28 | Q29 | Q30 | Q31 | Q32 | Q33B | Q33C | Q33E | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| IsNull | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| IsNa | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| Duplicate | 125 | 125 | 125 | 125 | 125 | 125 | 125 | 125 | 125 | 125 | ... | 125 | 125 | 125 | 125 | 125 | 125 | 125 | 125 | 125 | 125 |
| Unique | 4 | 3 | 2 | 2 | 4 | 5 | 5 | 2 | 2 | 2 | ... | 2 | 2 | 2 | 2 | 3 | 2 | 2 | 7 | 7 | 8 |
| Min | Assistant_Prof | 1 | Married | Joint | No_PhD | 1 | Always | No | No | Co | ... | No | No | No | No | N | No | No | All | All | All |
| Max | Professor | 3 | Unmarried | Nuclear | Pursuing_PhD | 5 | Sometimes_Depends | Yes | Yes | Myself | ... | Yes | Yes | Yes | Yes | Yes | Yes | Yes | None_of_Above | None_of_Above | RS |
6 rows × 71 columns
print(clr.text + '.: DataSet Information :.' + clr.end)
print("-"*30)
df.info()
.: DataSet Information :.
------------------------------
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 287 entries, 0 to 286
Data columns (total 71 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Designation 287 non-null object
1 Age 287 non-null int64
2 Marital 287 non-null object
3 Family 287 non-null object
4 Education 287 non-null object
5 Income 287 non-null int64
6 Q1 287 non-null object
7 Q2 287 non-null object
8 Q3 287 non-null object
9 Q4 287 non-null object
10 Q5 287 non-null object
11 Q6 287 non-null object
12 Q7 287 non-null object
13 Q8 287 non-null object
14 Q9A 287 non-null object
15 Q9B 287 non-null object
16 Q9C 287 non-null object
17 Q9D 287 non-null object
18 Q9E 287 non-null object
19 Q10 287 non-null object
20 Q11 287 non-null object
21 Q12A 287 non-null object
22 Q12B 287 non-null object
23 Q13A 287 non-null object
24 Q13C 287 non-null object
25 Q13D 287 non-null object
26 Q14A 287 non-null object
27 Q14B 287 non-null object
28 Q14C 287 non-null object
29 Q14D 287 non-null object
30 Q14E 287 non-null object
31 Q14F 287 non-null object
32 Q15A 287 non-null object
33 Q15B 287 non-null object
34 Q15C 287 non-null object
35 Q15D 287 non-null object
36 Q15E 287 non-null object
37 Q16A 287 non-null object
38 Q16B 287 non-null object
39 Q16C 287 non-null object
40 Q16D 287 non-null object
41 Q16E 287 non-null object
42 Q16F 287 non-null object
43 Q17A 287 non-null object
44 Q17B 287 non-null object
45 Q18B 287 non-null object
46 Q18C 287 non-null object
47 Q18E 287 non-null object
48 Q19A 287 non-null object
49 Q19C 287 non-null object
50 Q19D 287 non-null object
51 Q19E 287 non-null object
52 Q19F 287 non-null object
53 Q19G 287 non-null object
54 Q19H 287 non-null object
55 Q20 287 non-null object
56 Q21 287 non-null object
57 Q22 287 non-null object
58 Q23 287 non-null object
59 Q24 287 non-null object
60 Q25 287 non-null object
61 Q26 287 non-null object
62 Q27 287 non-null object
63 Q28 287 non-null object
64 Q29 287 non-null object
65 Q30 287 non-null object
66 Q31 287 non-null object
67 Q32 287 non-null object
68 Q33B 287 non-null object
69 Q33C 287 non-null object
70 Q33E 287 non-null object
dtypes: int64(2), object(69)
memory usage: 159.3+ KB
print(clr.text + '.: Missing Values by Column :.' + clr.end)
print("-"*30)
print(df.isna().sum())
print("-"*30)
print("TOTAL MISSING VALUES:",df.isna().sum().sum())
.: Missing Values by Column :.
------------------------------
Designation 0
Age 0
Marital 0
Family 0
Education 0
..
Q31 0
Q32 0
Q33B 0
Q33C 0
Q33E 0
Length: 71, dtype: int64
------------------------------
TOTAL MISSING VALUES: 0
Categorizing the Question
| Category | Question No: Question |
|---|---|
| Demographic Information | Q1: I am involved in the purchase of these product category |
| Q2: I only take decisions for any purchase in the family | |
| Q3: I am involved in the decision making for the purchase in the family | |
| Decision-Making Behavior | Q4: When I purchase for myself, I take the decision by |
| Q5: When I purchase for myself, I consult | |
| Q6: I take a decision when a purchase is to be made for | |
| Bargaining Behavior | Q7: I bargain |
| Q8: I purchase products when I get the bargain | |
| Decision-Making Influence | Q9A-Q9H: I take the decision myself to buy these products / I consult family/others to buy these items / I take the decision based on choices of family members for the following product category / I buy these branded products online / I buy these non-branded products online / I buy these branded products offline / I buy these non-branded products offline |
| Shopping Habits | Q10: I go for shopping for one particular item and end up purchasing other items also |
| Q11: I shop till I drop | |
| Q12A-Q12B: I buy branded products from these shops / I buy non-branded products from these shops | |
| Purchase Timing | Q13A-Q13D: I buy these products in the Morning / I buy these products in the Afternoon / I buy these products in the Evening / I buy these products in the Night |
| Q14A-Q14F: I buy these products Daily / I buy these products Weekly / I buy these products Monthly / I buy these products Quarterly / I buy these products Half Yearly / I buy these products Yearly | |
| Special Occasions | Q15A-Q15E: I buy these products on Weekly Holiday / I buy these products on Holiday / I buy these products on Festivals / I buy these products on Family functions / I buy these products on Birthdays |
| Purchase Location | Q16A-Q16F: I buy these products from Branded Retailers / I buy these products from Company Showroom / I buy these products from Factory Outlet / I buy these products from Malls / I buy these products from Nearby Retailer / I buy these products from Roadside shop |
| Payment Preferences | Q17A-Q17B: I buy branded products from these shops / I buy non-branded products from these shops |
| Q18A-Q18F: I buy these products Online / I buy these products by using Debit Card / I buy these products by using Credit Card / I buy these products by using Cheque / I buy these products in Cash / I buy these products on EMI | |
| Quantity and Frequency | Q19A-Q19H: I buy these products only one quantity / I buy these products in Two quantity / I buy these products in Weekly Quantity / I buy these products in Monthly Quantity / I buy these products in Quarterly Quantity / I buy these products in Half Yearly Quantity / I buy these products in Yearly Quantity / I buy these products as per need |
| Preference for Offers | Q20: I refer this for offers |
| Shopping Behavior | Q21-Q32: Normally I visit only one shop which I know / I visit number of shops till I get what I want / I like to buy from shops which have lots of variety / I like to buy from shops where the sales people are cordial / I don't like to buy from shops where sales people promote specific products or show products of their choice / I take the opinion of the sales people of shops / I would like to buy from shops which are open upto 10-11 pm / I like to buy from shops which sells at Fixed rates / I like to buy from shops which offer discounts / I like to buy from shops which also offer services to customize the product to suit my requirement / I like to buy from shops which allow trial or give demo / I like to buy from shops where sales people give enough information about product |
| Payment Methods | Q33A-Q33E: I prefer to buy from shops which accept Online payments / I prefer to buy from shops which accept Debit Card / I prefer to buy from shops which accept Credit Card / I prefer to buy from shops which accept Cheque / I prefer to buy from shops which accept Cash |
Demographic_Information_Questions = ['Q1', 'Q2', 'Q3']
Decision_Making_Behavior_Questions = ['Q4', 'Q5', 'Q6']
Bargaining_Behavior_Questions = ['Q7', 'Q8']
Decision_Making_Influence_Questions = ['Q9A', 'Q9B', 'Q9C', 'Q9D', 'Q9E', 'Q9F', 'Q9G', 'Q9H']
Shopping_Habits_Questions = ['Q10', 'Q11', 'Q12A', 'Q12B']
Purchase_Timing_Questions = ['Q13A', 'Q13B', 'Q13C', 'Q13D', 'Q14A', 'Q14B', 'Q14C', 'Q14D', 'Q14E', 'Q14F']
Special_Occasions_Questions = ['Q15A', 'Q15B', 'Q15C', 'Q15D', 'Q15E']
Purchase_Location_Questions = ['Q16A', 'Q16B', 'Q16C', 'Q16D', 'Q16E', 'Q16F']
Payment_Preferences_Questions = ['Q17A', 'Q17B', 'Q18A', 'Q18B', 'Q18C', 'Q18D', 'Q18E', 'Q18F']
Quantity_and_Frequency_Questions = ['Q19A', 'Q19B', 'Q19C', 'Q19D', 'Q19E', 'Q19F', 'Q19G', 'Q19H']
Preference_for_Offers_Questions = ['Q20']
Shopping_Behavior_Questions = ['Q21', 'Q22', 'Q23', 'Q24', 'Q25', 'Q26', 'Q27', 'Q28', 'Q29', 'Q30', 'Q31', 'Q32']
Payment_Method_Questions = ['Q33A', 'Q33B', 'Q33C', 'Q33D', 'Q33E']
question_categories = {
"Demographic Information": Demographic_Information_Questions,
"Decision Making Behavior": Decision_Making_Behavior_Questions,
"Bargaining Behavior": Bargaining_Behavior_Questions,
"Decision Making Influence": Decision_Making_Influence_Questions,
"Shopping Habits": Shopping_Habits_Questions,
"Purchase Timing": Purchase_Timing_Questions,
"Special Occasions": Special_Occasions_Questions,
"Purchase Location": Purchase_Location_Questions,
"Payment Preferences": Payment_Preferences_Questions,
"Quantity and Frequency": Quantity_and_Frequency_Questions,
"Preference for Offers": Preference_for_Offers_Questions,
"Shopping Behavior": Shopping_Behavior_Questions,
"Payment Method": Payment_Method_Questions
}
# Count the number of questions in each category
category_counts = {category: len(questions) for category, questions in question_categories.items()}
# Plot the pie chart
fig = go.Figure(data=[go.Pie(labels=list(category_counts.keys()), values=list(category_counts.values()))])
fig.update_layout(title="Distribution of Questions by Category", width=800, height=600)
fig.show()
# Categorical Columns
cat_columns = ['Designation', 'Marital', 'Family', 'Education']
print(clr.text + '.: Categorical Columns :.' + clr.end)
print(f" {cat_columns}")
print("-"*50)
# Numerical Columns
num_columns = ['Age', 'Income']
print(clr.text + '.: Numerical Columns :.' + clr.end)
print(f" {num_columns}")
print("-"*50)
# Binary Columns
binary_columns = []
for column in df.columns:
if df[column].nunique() == 2 and set(df[column].unique()) == {"Yes", "No"}:
binary_columns.append(column)
print(clr.text + '.: Binary Columns :.' + clr.end)
print(f" {binary_columns}")
print("-"*50)
categorical_and_binary_cols = ['Designation', 'Marital', 'Family', 'Education','Q1',
'Q2', 'Q3', 'Q4', 'Q5', 'Q6', 'Q7', 'Q8', 'Q9A', 'Q9B', 'Q9C', 'Q9D',
'Q9E', 'Q10', 'Q11', 'Q12A', 'Q12B', 'Q13A', 'Q13C', 'Q13D', 'Q14A',
'Q14B', 'Q14C', 'Q14D', 'Q14E', 'Q14F', 'Q15A', 'Q15B', 'Q15C', 'Q15D',
'Q15E', 'Q16A', 'Q16B', 'Q16C', 'Q16D', 'Q16E', 'Q16F', 'Q17A', 'Q17B',
'Q18B', 'Q18C', 'Q18E', 'Q19A', 'Q19C', 'Q19D', 'Q19E', 'Q19F', 'Q19G',
'Q19H', 'Q20', 'Q21', 'Q22', 'Q23', 'Q24', 'Q25', 'Q26', 'Q27', 'Q28',
'Q29', 'Q30', 'Q31', 'Q32', 'Q33B', 'Q33C', 'Q33E']
print(clr.text + '.: Categorical and Binary Columns :.' + clr.end)
print(f" {categorical_and_binary_cols}")
print("-"*50)
.: Categorical Columns :. ['Designation', 'Marital', 'Family', 'Education'] -------------------------------------------------- .: Numerical Columns :. ['Age', 'Income'] -------------------------------------------------- .: Binary Columns :. ['Q2', 'Q3', 'Q7', 'Q8', 'Q10', 'Q11', 'Q21', 'Q22', 'Q23', 'Q24', 'Q26', 'Q27', 'Q28', 'Q29', 'Q31', 'Q32'] -------------------------------------------------- .: Categorical and Binary Columns :. ['Designation', 'Marital', 'Family', 'Education', 'Q1', 'Q2', 'Q3', 'Q4', 'Q5', 'Q6', 'Q7', 'Q8', 'Q9A', 'Q9B', 'Q9C', 'Q9D', 'Q9E', 'Q10', 'Q11', 'Q12A', 'Q12B', 'Q13A', 'Q13C', 'Q13D', 'Q14A', 'Q14B', 'Q14C', 'Q14D', 'Q14E', 'Q14F', 'Q15A', 'Q15B', 'Q15C', 'Q15D', 'Q15E', 'Q16A', 'Q16B', 'Q16C', 'Q16D', 'Q16E', 'Q16F', 'Q17A', 'Q17B', 'Q18B', 'Q18C', 'Q18E', 'Q19A', 'Q19C', 'Q19D', 'Q19E', 'Q19F', 'Q19G', 'Q19H', 'Q20', 'Q21', 'Q22', 'Q23', 'Q24', 'Q25', 'Q26', 'Q27', 'Q28', 'Q29', 'Q30', 'Q31', 'Q32', 'Q33B', 'Q33C', 'Q33E'] --------------------------------------------------
Creating Dataset Report
ProfileReport(df[['Designation','Age','Marital','Family','Education','Income']],
title="Customer Segmentation ",
minimal=True,
progress_bar=False,
samples=None,
interactions=None,
explorative=True,
dark_mode=True,
notebook={'iframe': {'height': '600px'},
'html': {'style': {'primary_color': clr}},
'missing_diagrams': {'heatmap': False, 'dendrogram': False}}
).to_notebook_iframe()
C:\Users\Darshan\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.11_qbz5n2kfra8p0\LocalCache\local-packages\Python311\site-packages\ydata_profiling\profile_report.py:506: DeprecationWarning: Importing display from IPython.core.display is deprecated since IPython 7.14, please import from IPython display C:\Users\Darshan\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.11_qbz5n2kfra8p0\LocalCache\local-packages\Python311\site-packages\ydata_profiling\report\presentation\flavours\widget\correlation_table.py:1: DeprecationWarning: Importing display from IPython.core.display is deprecated since IPython 7.14, please import from IPython display C:\Users\Darshan\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.11_qbz5n2kfra8p0\LocalCache\local-packages\Python311\site-packages\ydata_profiling\report\presentation\flavours\widget\duplicate.py:1: DeprecationWarning: Importing display from IPython.core.display is deprecated since IPython 7.14, please import from IPython display C:\Users\Darshan\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.11_qbz5n2kfra8p0\LocalCache\local-packages\Python311\site-packages\ydata_profiling\report\presentation\flavours\widget\sample.py:1: DeprecationWarning: Importing display from IPython.core.display is deprecated since IPython 7.14, please import from IPython display
fig_age_bar = go.Figure(go.Bar(
x=df['Age'].value_counts().index,
y=df['Age'].value_counts().values,
marker_color="#3A0CA3"
))
fig_age_bar.update_layout(title_text="Distribution of Age", xaxis=dict(title='Age'), yaxis=dict(title='Count'))
# Create bar chart for Income
fig_income_bar = go.Figure(go.Bar(
x=df['Income'].value_counts().index,
y=df['Income'].value_counts().values,
marker_color="#3A0CA3"
))
fig_income_bar.update_layout(title_text="Distribution of Income", xaxis=dict(title='Income'), yaxis=dict(title='Count'))
# Show the plots
fig_age_bar.show()
fig_income_bar.show()
sns.set_palette(palette)
# Iterate through categorical columns
for i, col in enumerate(cat_columns):
# Create subplots
fig = go.Figure()
# Plot count plot
fig.add_trace(go.Bar(
x=df[col].value_counts().index,
y=df[col].value_counts().values,
name='Count'
))
# Update layout for count plot
fig.update_layout(
title=f"Distribution of {col}",
xaxis=dict(title=col),
yaxis=dict(title='Count')
)
# Show the plot
fig.show()
# Create another figure for pie chart
fig_pie = go.Figure()
# Plot pie chart
fig_pie.add_trace(go.Pie(
labels=df[col].value_counts().index,
values=df[col].value_counts().values,
hole=0.7,
name='Percentage',
hoverinfo='label+percent',
textinfo='percent',
textfont_size=15,
))
# Update layout for pie chart
fig_pie.update_layout(
title=f"Distribution of {col} (Percentage)",
)
# Show the pie chart
fig_pie.show()
for column in categorical_and_binary_cols:
# Calculate value counts
counts = df[column].value_counts()
# Create pie chart
fig = go.Figure(data=[go.Pie(labels=counts.index, values=counts.values, hole=.3)])
fig.update_layout(title_text=f'{column}: {questions[column][0]}', showlegend=True)
fig.show()
💡 Plot Analysis¶
From dataset plot, we have observed that:
- The distribution of the Education column shows that a majority of respondents have pursued or completed a PhD, while a smaller portion are currently pursuing a PhD or have no PhD.
- In the Marital column, most respondents are married, with a smaller proportion being single.
- The Family column indicates that the majority of respondents come from nuclear families, with fewer respondents from joint families.
- Regarding the Income column, there is a varied distribution, with a significant number of respondents having a low income and a smaller portion having a high income.
- The distribution of Age shows that the majority of respondents fall into the younger age group, with fewer respondents in the older age group.
6.1 | Processing Pipeline 🪠
x_train and x_test data. Not all columns will go through preprocessing. For all numerical columns, scaling will be carried out using a MinMax scaler since the dataset used is a small dataset where the presence of outliers dramatically affects the performance of a model. While for categorical columns with more than two categories, one-hot encoding will be carried out and for categorical columns with hierarchical order, ordinal encoding will be carried out| Column | Description | Encoding or Scalling |
|---|---|---|
| Designation | This column represents the positions or designations of the respondents. Since it has a hierarchical order, using ordinal encoding would be more appropriate. Ordinal encoding assigns numerical values based on the position of each category in the hierarchy, preserving the hierarchical relationship between categories. | Ordinal Encoding |
| Age | This column represents the age of the respondents in years. Since it's a numerical variable, no encoding is required. However, you may need to scale it using Min-Max Scaling if the range of ages is large. | Min-Max Scaling required |
| Marital | This column represents the marital status of the respondents. Since it's a nominal categorical variable with no hierarchical order, One-Hot Encoding can be used. | One-Hot Encoding |
| Family | This column indicates the type of family the respondents belong to. Since it's a nominal categorical variable with no hierarchical order, One-Hot Encoding can be used. | One-Hot Encoding |
| Education | This column represents the education level of the respondents. It has a hierarchical order, such as Non-PhD, Post-PhD, and PhD. Therefore, using ordinal encoding would be more suitable. Ordinal encoding assigns numerical values based on the position of each category in the hierarchy, preserving the hierarchical relationship between categories. | Ordinal Encoding |
| Income | This column represents the approximate monthly income of the respondents. Since it's a numerical variable, no encoding is required. However, you may need to scale it using Min-Max Scaling if the range of incomes is large. | Min-Max Scaling required |
| Q1 to Q33F | nominal categorical variables with no hierarchical order. You can use One-Hot Encoding to encode the responses into binary vectors | One-Hot Encoding |
# --- Creating copy of Dataset ---
X = df.copy()
Ordinal Encoding
designation_mapping = {
'Assistant_Prof': 0,
'Associate_Prof': 1,
'Associate_Professor': 2,
'Professor': 3
}
X['Designation'].replace(designation_mapping, inplace=True)
education_mapping = {
'No_PhD': 0,
'Pursuing_PhD': 1,
'PhD': 2,
'Post_PhD': 3
}
X['Education'].replace(education_mapping, inplace=True)
Min Max Scaling
# Scaling Age
df['Age'] = (df['Age'] - df['Age'].min()) / (df['Age'].max() -df['Age'].min())
# Scaling Income
df['Income'] = (df['Income'] - df['Income'].min()) / (df['Income'].max() -df['Income'].min())
Binary Encoding
for column in binary_columns:
X.replace({column: {'No': 0, 'Yes': 1}}, inplace=True)
# Will Later used
X_Clustred = X.copy()
One Hot Encoding
one_hot_columns = [
'Education', 'Marital', 'Family', 'Q1', 'Q4', 'Q5', 'Q6',
'Q9A', 'Q9B', 'Q9C', 'Q9D', 'Q9E', 'Q12A', 'Q12B', 'Q13A', 'Q13C', 'Q13D',
'Q14A', 'Q14B', 'Q14C', 'Q14D', 'Q14E', 'Q14F', 'Q15A', 'Q15B', 'Q15C',
'Q15D', 'Q15E', 'Q16A', 'Q16B', 'Q16C', 'Q16D', 'Q16E', 'Q16F', 'Q17A',
'Q17B', 'Q18B', 'Q18C', 'Q18E', 'Q19A', 'Q19C', 'Q19D', 'Q19E', 'Q19F',
'Q19G', 'Q19H', 'Q20', 'Q25', 'Q30', 'Q33B', 'Q33C', 'Q33E'
]
encoder = OneHotEncoder(drop='if_binary', sparse=False)
X_encoded = encoder.fit_transform(X[one_hot_columns])
encoded_columns = encoder.get_feature_names_out(one_hot_columns)
X_processed_df = pd.DataFrame(X_encoded, columns=encoded_columns, index=X.index)
X_processed_df = pd.concat([X_processed_df, X.drop(columns=one_hot_columns)], axis=1)
X_processed_df.shape
(287, 333)
6.2 | Anomaly Detection using Isolation Forest
Isolation Forest is an ensemble-based anomaly detection algorithm that is particularly effective for identifying outliers or anomalies in high-dimensional datasets. It works by isolating anomalous data points in a tree structure and measuring their isolation depth.
The algorithm randomly selects a feature and a split value to partition the data at each node of the tree. Anomalies are expected to require fewer splits to be isolated, resulting in shorter path lengths in the tree. By averaging the path lengths across multiple trees, Isolation Forest can effectively identify anomalies.
Isolation Forest is well-suited for detecting anomalies in large-scale datasets with mixed attribute types. It is robust to the presence of irrelevant features and can efficiently handle high-dimensional data. Isolation Forest has applications in cybersecurity, fraud detection, network intrusion detection, and outlier detection in sensor data.
numerical_features = X_processed_df.select_dtypes(include=['float64', 'int64'])
model = IsolationForest(contamination=0.05)
model.fit(numerical_features)
X_processed_df['anomaly_score'] = model.decision_function(numerical_features)
X_processed_df['anomaly_label'] = model.predict(numerical_features)
anomalies = X_processed_df[X_processed_df['anomaly_label'] == -1]
#Dropping anomalies from DataFrame
X_processed_df = X_processed_df[X_processed_df['anomaly_label'] != -1]
# Drop the anomaly_score and anomaly_label columns
X_processed_df.drop(columns=['anomaly_score', 'anomaly_label'], inplace=True)
fig = px.scatter(anomalies, x='Age', y='Income', color='anomaly_label',
title='Anomalies Detected by Isolation Forest (2D)')
fig.show()
# DataFrame After Removing Anamolies
X_processed_df.shape
(272, 333)
6.3 | Features Separating and Splitting 🪓
# --- Splitting Dataset ---
X_train, X_test = train_test_split(X_processed_df, test_size=0.1, random_state=42)
print(f"{X_train.shape = }")
print(f"{X_test.shape = }")
X_train.shape = (244, 333) X_test.shape = (28, 333)
Principle Component Analysis 🪄
Principle Component Analysis or PCA, is a statistical technique used for dimensionality reduction and data compression. It aims to transform high-dimensional data into a lower-dimensional space while retaining most of the important information.
In this dataset there many features present.PCA can be used to represent data from higher dimenetions to lower dimention which help in order to better segment the customers and better visualization
Explained variance:
Explained variance refers to the proportion of total variance in the dataset that a specific principal component accounts for in PCA. It helps assess the significance of each component in capturing the variability of the data.
Cumulative variance:
Cumulative variance represents the total amount of variance explained by a subset of components in a dataset. In PCA, it's the sum of variances explained by each component up to a certain point.
pca = PCA(n_components=100)
pca.fit(X_train)
explained_variance_ratio = pca.explained_variance_ratio_
cumulative_variance = np.cumsum(explained_variance_ratio)
optimal_components = np.where(np.diff(cumulative_variance) < 0.01)[0][0] + 1
fig = go.Figure()
fig.add_trace(go.Scatter(x=list(range(1, 101)), y=list(cumulative_variance), mode='lines+markers', name='Cumulative Explained Variance'))
fig.add_shape(
type='line',
x0=1,
y0=cumulative_variance[optimal_components - 1],
x1=100,
y1=cumulative_variance[optimal_components - 1],
line=dict(color='black', width=2, dash='dash')
)
fig.add_annotation(
x=25,
y=cumulative_variance[optimal_components - 1] + 0.02,
text=f'Optimal Components = {optimal_components}',
showarrow=True,
arrowhead=1,
ax=0,
ay=-40
)
fig.update_layout(
title='Cumulative Explained Variance by Number of Components',
xaxis_title='Number of Components',
yaxis_title='Cumulative Explained Variance',
xaxis=dict(dtick=10),
yaxis=dict(tickformat='.2%'),
showlegend=True
)
fig.show()
# Initiating PCA to reduce dimensions aka features to 3
pca = PCA(n_components=3)
pca.fit(X_train)
X_reduced = pd.DataFrame(pca.transform(X_train),columns=(["x","y", "z"]))
print("Reduced X After PCA")
print("-"*30)
X_reduced.describe().T
Reduced X After PCA ------------------------------
| count | mean | std | min | 25% | 50% | 75% | max | |
|---|---|---|---|---|---|---|---|---|
| x | 244.0 | -7.644159e-17 | 1.656185 | -2.466837 | -1.386465 | -0.201163 | 1.365579 | 3.488232 |
| y | 244.0 | 1.456030e-17 | 1.489839 | -2.796929 | -1.085418 | 0.138452 | 0.889171 | 4.142758 |
| z | 244.0 | 5.824121e-17 | 1.390931 | -3.200785 | -0.709497 | -0.095810 | 0.716121 | 4.858280 |
8.1 | K-Means Clustering
K-Means clustering is a popular unsupervised machine learning algorithm used for partitioning a dataset into a specified number of clusters. It aims to group similar data points together while keeping dissimilar points in different clusters. K-Means clustering is widely used for exploratory data analysis, data visualization, and pattern recognition.
The algorithm works by iteratively assigning each data point to the nearest centroid and then recalculating the centroids based on the mean of the data points in each cluster. This process continues until the centroids stabilize or the maximum number of iterations is reached.
Using K-Means clustering allows for the identification of natural groupings or clusters within the dataset, enabling better understanding and interpretation of the underlying structure of the data. It is particularly useful for segmenting data into homogeneous subgroups, which can then be analyzed separately or used as features in subsequent machine learning models.
Finding the optimal number of clusters based on the maximum silhouette score
silhouette_scores = []
k_values = range(2, 5)
for k in k_values:
kmeans = KMeans(n_clusters=k, random_state=42)
kmeans.fit(X_train)
labels = kmeans.labels_
silhouette_avg = silhouette_score(X_train, labels)
silhouette_scores.append(silhouette_avg)
optimal_k = k_values[np.argmax(silhouette_scores)]
max_silhouette_score = max(silhouette_scores)
print("Optimal number of clusters:", optimal_k)
print("Maximum silhouette score:", max_silhouette_score)
Optimal number of clusters: 4 Maximum silhouette score: 0.11334930313916518
# Initiating the KMeans model
kmeans = KMeans(n_clusters=4)
# Fit model and predict clusters
Kmeans_X = X_reduced.copy()
yhat_kmeans = kmeans.fit_predict(X_reduced)
# Assigning cluster labels to the reduced data
Kmeans_X["Clusters"] = yhat_kmeans
palette = ["#7F58AF","#64C5EB", "#E84D8A","#F3B326"]
fig = go.Figure(data=[go.Scatter3d(
x=Kmeans_X['x'],
y=Kmeans_X['y'],
z=Kmeans_X['z'],
mode='markers',
marker=dict(
size=4,
color=Kmeans_X["Clusters"],
colorscale= palette,
opacity=0.8
)
)])
# Update layout
fig.update_layout(
scene=dict(
xaxis=dict(title='X'),
yaxis=dict(title='Y'),
zaxis=dict(title='Z')
),
title="Clustering using K-Means")
# Show the plot
fig.show()
cluster_counts = Kmeans_X["Clusters"].value_counts().reset_index()
cluster_counts.columns = ["Cluster", "Count"]
# Plot countplot using Plotly
fig = px.bar(cluster_counts, x="Cluster", y="Count", color="Cluster",
labels={"Cluster": "Cluster", "Count": "Count"},
title="Distribution Of The Clusters using KMeans",
color_discrete_sequence=palette)
fig.show()
8.2 | DBSCAN Clustering
DBSCAN (Density-Based Spatial Clustering of Applications with Noise) is a clustering algorithm that groups together closely packed points based on density. Unlike K-Means, DBSCAN does not require specifying the number of clusters beforehand and can identify clusters of arbitrary shapes and sizes.
The algorithm defines clusters as dense regions of points separated by regions of lower density. Points in dense regions are considered core points, while points in lower-density regions that are close to core points are considered border points. Points that are neither core nor border points are classified as noise.
DBSCAN is particularly useful for datasets with complex structures and varying densities. It automatically discovers clusters and is robust to outliers. Additionally, DBSCAN does not assume clusters to be globular or convex-shaped, making it suitable for a wide range of applications in data mining, pattern recognition, and spatial data analysis.
dbscan = DBSCAN(eps=0.5, min_samples=5)
DBscan_X = X_reduced.copy()
yhat_dbscan = dbscan.fit_predict(X_reduced)
DBscan_X["Clusters"] = yhat_dbscan
fig = go.Figure(data=[go.Scatter3d(
x=DBscan_X['x'],
y=DBscan_X['y'],
z=DBscan_X['z'],
mode='markers',
marker=dict(
size=4,
color=DBscan_X["Clusters"],
colorscale=palette,
opacity=0.8
)
)])
# Update layout
fig.update_layout(
scene=dict(
xaxis=dict(title='X'),
yaxis=dict(title='Y'),
zaxis=dict(title='Z')
),
title="Clustering using K-Means")
# Show the plot
fig.show()
cluster_counts = DBscan_X["Clusters"].value_counts().reset_index()
cluster_counts.columns = ["Cluster", "Count"]
fig = px.bar(cluster_counts, x="Cluster", y="Count", color="Cluster",
labels={"Cluster": "Cluster", "Count": "Count"},
title="Distribution Of The Clusters using DBSCAN",
color_discrete_sequence=palette)
fig.show()
8.3 | Hierarchical Clustering
Hierarchical clustering is a method of cluster analysis that builds a hierarchy of clusters. It does not require specifying the number of clusters beforehand and can be visualized using a dendrogram. There are two main types of hierarchical clustering: agglomerative and divisive.
Agglomerative hierarchical clustering starts with each data point as its own cluster and iteratively merges the closest pairs of clusters until only one cluster remains. Divisive hierarchical clustering, on the other hand, starts with all data points in one cluster and recursively splits clusters into smaller clusters until each data point is in its own cluster.
Hierarchical clustering is useful for exploring the structure of the data and identifying nested clusters at different levels of granularity. It provides insights into the relationships between data points and can be visualized to aid interpretation. Hierarchical clustering is commonly used in biological taxonomy, social network analysis, and customer segmentation.
AC = AgglomerativeClustering(n_clusters=4)
AC_X = X_reduced.copy()
yhat_AC = AC.fit_predict(X_reduced)
AC_X["Clusters"] = yhat_AC
fig = go.Figure(data=[go.Scatter3d(
x=AC_X['x'],
y=AC_X['y'],
z=AC_X['z'],
mode='markers',
marker=dict(
size=4,
color=AC_X["Clusters"],
colorscale=palette,
opacity=0.8
)
)])
# Update layout
fig.update_layout(
scene=dict(
xaxis=dict(title='X'),
yaxis=dict(title='Y'),
zaxis=dict(title='Z')
),
title="Clustering using Agglomerative Clustering")
# Show the plot
fig.show()
cluster_counts = AC_X["Clusters"].value_counts().reset_index()
cluster_counts.columns = ["Cluster", "Count"]
# Plot countplot using Plotly
fig = px.bar(cluster_counts, x="Cluster", y="Count", color="Cluster",
labels={"Cluster": "Clusters", "Count": "Count"},
title="Distribution Of The Clusters using Agglomerative Clustering",
color_discrete_sequence=palette)
fig.show()
💡 Clustering Analysis
Now we have segmented diffrent types of customers based on their features and analyzed their behavior pattern
Here are different segments of customers we have observed:
- Cluster 1: Independent Decision Makers with Varied Shopping Habits
They make decisions independently, consult with family selectively, and shop across a variety of outlets. Their purchases are based on personal needs, and they use a mix of payment methods. Quantity and frequency of purchases vary.
- Cluster 2: Consultative Decision Makers with Family Influence
They involve family in decisions, prefer branded products, and shop more frequently. They prefer structured shopping from branded retailers and company showrooms, often using credit cards or cash for payments.
- Cluster 3: Bargaining Decision Makers with Opportunistic Shopping Habits
Actively involved in bargaining, they make independent decisions, seizing bargains and discounts. They buy a mix of branded and non-branded products, with opportunistic shopping behaviors across various locations. Cash is their preferred payment method.
- Cluster 4: Routine Decision Makers with Fixed Shopping Patterns
They decide independently, following fixed patterns, and prefer specific outlets for branded products. Their purchases are timed, often around specific occasions, and they prefer shops offering discounts or fixed rates. Debit cards or cash are their preferred payment methods.
9.1 | Apriori Algorithm
The Apriori algorithm is a classic algorithm used for association rule mining in transactional datasets. It aims to discover frequent itemsets, which are sets of items that frequently occur together in transactions. These itemsets are then used to generate association rules that describe the relationships between items.
The Apriori algorithm employs a bottom-up approach, where it starts by identifying individual items that meet a minimum support threshold. It then iteratively generates larger itemsets by combining smaller itemsets that also meet the minimum support threshold.
The Apriori algorithm is widely used in market basket analysis, recommendation systems, and customer behavior analysis. It helps businesses uncover patterns and trends in transactional data, leading to insights that can be used for targeted marketing, cross-selling, and product placement strategies.
apriori_df = pd.read_excel("survey.xlsx")
binary_columns = []
for column in apriori_df.columns:
if apriori_df[column].nunique() == 2 or set(apriori_df[column].unique()) == {"Yes", "No"}:
binary_columns.append(column)
association_dataset = []
for index, row in apriori_df.iterrows():
association_row = []
for column in binary_columns:
if row[column] == "Yes":
association_row.append(column)
association_dataset.append(association_row)
# Convert elements in the list to strings
association_dataset = [[str(item) for item in row] for row in association_dataset]
# Initialize and fit the transaction encoder
encoder = TransactionEncoder()
encoder.fit(association_dataset)
# Transform the transactions into a one-hot encoded DataFrame
onehot = encoder.transform(association_dataset)
df_transformed = pd.DataFrame(onehot, columns=encoder.columns_)
# Find frequent itemsets with minimum support
frequent_itemsets = apriori(df_transformed, min_support=0.9, use_colnames=True)
# Generate association rules
rules = association_rules(frequent_itemsets, metric="confidence", min_threshold=0.9)
styled_df = rules.head(20).reset_index(drop=True).style.background_gradient(cmap='Blues').set_table_styles([{'selector': 'tr:hover', 'props': [('background-color', '')]}])
styled_df
| antecedents | consequents | antecedent support | consequent support | support | confidence | lift | leverage | conviction | zhangs_metric | |
|---|---|---|---|---|---|---|---|---|---|---|
| 0 | frozenset({'Q23'}) | frozenset({'Q3'}) | 0.982578 | 0.996516 | 0.982578 | 1.000000 | 1.003497 | 0.003424 | inf | 0.200000 |
| 1 | frozenset({'Q3'}) | frozenset({'Q23'}) | 0.996516 | 0.982578 | 0.982578 | 0.986014 | 1.003497 | 0.003424 | 1.245645 | 1.000000 |
| 2 | frozenset({'Q23'}) | frozenset({'Q31'}) | 0.982578 | 0.986063 | 0.975610 | 0.992908 | 1.006942 | 0.006726 | 1.965157 | 0.395714 |
| 3 | frozenset({'Q31'}) | frozenset({'Q23'}) | 0.986063 | 0.982578 | 0.975610 | 0.989399 | 1.006942 | 0.006726 | 1.643438 | 0.494643 |
| 4 | frozenset({'Q23'}) | frozenset({'Q32'}) | 0.982578 | 0.919861 | 0.912892 | 0.929078 | 1.010020 | 0.009057 | 1.129965 | 0.569466 |
| 5 | frozenset({'Q32'}) | frozenset({'Q23'}) | 0.919861 | 0.982578 | 0.912892 | 0.992424 | 1.010020 | 0.009057 | 2.299652 | 0.123797 |
| 6 | frozenset({'Q31'}) | frozenset({'Q3'}) | 0.986063 | 0.996516 | 0.982578 | 0.996466 | 0.999951 | -0.000049 | 0.986063 | -0.003534 |
| 7 | frozenset({'Q3'}) | frozenset({'Q31'}) | 0.996516 | 0.986063 | 0.982578 | 0.986014 | 0.999951 | -0.000049 | 0.996516 | -0.013986 |
| 8 | frozenset({'Q32'}) | frozenset({'Q3'}) | 0.919861 | 0.996516 | 0.919861 | 1.000000 | 1.003497 | 0.003205 | inf | 0.043478 |
| 9 | frozenset({'Q3'}) | frozenset({'Q32'}) | 0.996516 | 0.919861 | 0.919861 | 0.923077 | 1.003497 | 0.003205 | 1.041812 | 1.000000 |
| 10 | frozenset({'Q31'}) | frozenset({'Q32'}) | 0.986063 | 0.919861 | 0.912892 | 0.925795 | 1.006451 | 0.005852 | 1.079973 | 0.459924 |
| 11 | frozenset({'Q32'}) | frozenset({'Q31'}) | 0.919861 | 0.986063 | 0.912892 | 0.992424 | 1.006451 | 0.005852 | 1.839721 | 0.079987 |
| 12 | frozenset({'Q23', 'Q31'}) | frozenset({'Q3'}) | 0.975610 | 0.996516 | 0.975610 | 1.000000 | 1.003497 | 0.003399 | inf | 0.142857 |
| 13 | frozenset({'Q23', 'Q3'}) | frozenset({'Q31'}) | 0.982578 | 0.986063 | 0.975610 | 0.992908 | 1.006942 | 0.006726 | 1.965157 | 0.395714 |
| 14 | frozenset({'Q31', 'Q3'}) | frozenset({'Q23'}) | 0.982578 | 0.982578 | 0.975610 | 0.992908 | 1.010513 | 0.010149 | 2.456446 | 0.597143 |
| 15 | frozenset({'Q23'}) | frozenset({'Q31', 'Q3'}) | 0.982578 | 0.982578 | 0.975610 | 0.992908 | 1.010513 | 0.010149 | 2.456446 | 0.597143 |
| 16 | frozenset({'Q31'}) | frozenset({'Q23', 'Q3'}) | 0.986063 | 0.982578 | 0.975610 | 0.989399 | 1.006942 | 0.006726 | 1.643438 | 0.494643 |
| 17 | frozenset({'Q3'}) | frozenset({'Q23', 'Q31'}) | 0.996516 | 0.975610 | 0.975610 | 0.979021 | 1.003497 | 0.003399 | 1.162602 | 1.000000 |
| 18 | frozenset({'Q23', 'Q32'}) | frozenset({'Q3'}) | 0.912892 | 0.996516 | 0.912892 | 1.000000 | 1.003497 | 0.003181 | inf | 0.040000 |
| 19 | frozenset({'Q23', 'Q3'}) | frozenset({'Q32'}) | 0.982578 | 0.919861 | 0.912892 | 0.929078 | 1.010020 | 0.009057 | 1.129965 | 0.569466 |
unique_questions = set()
for antecedent, consequent in zip(rules['antecedents'], rules['consequents']):
unique_questions.update(antecedent)
unique_questions.update(consequent)
G = nx.DiGraph()
for question in unique_questions:
G.add_node(question)
for idx, rule in rules.iterrows():
antecedent, consequent = rule['antecedents'], rule['consequents']
for a in antecedent:
for c in consequent:
G.add_edge(a, c)
pos = nx.circular_layout(G)
plt.figure(figsize=(10, 10))
nx.draw_networkx(G, pos, with_labels=True, node_size=2000, node_color='skyblue', font_size=12, font_weight='bold', edge_color='grey', width=2, alpha=0.7)
plt.title("Questions and Association Rules")
plt.show()
10. | Conclusions and Future Improvements 🧐
- EDA (Exploratory Data Analysis): The exploratory data analysis revealed important insights into the structure and distribution of the data. It helped in understanding the characteristics of different variables and their relationships.
- PCA (Principal Component Analysis): PCA was employed for dimensionality reduction, which enabled visualizing high-dimensional data in lower dimensions. It helped in identifying the most significant features contributing to the variance in the dataset.
- Clustering: Clustering algorithms, such as K-means,DBSCAN and Agglomerative Clustering, were applied to group similar data points together. This facilitated the identification of patterns and subgroups within the data.
- Anomaly Detection: Anomaly detection techniques were utilized to identify unusual or outlying observations in the dataset. This aided in detecting potential errors or anomalies that deviate from normal behavior.
- Apriori Algorithm: The Apriori algorithm was used for association rule mining to discover interesting relationships between variables. It helped in uncovering more than 50 frequent patterns and association rules, which could be valuable for Decision-making and Marketing.
Segmented Customers Independent Shoppers: Make decisions independently, varying shopping habits, buy as needed. Family Consulters: Consult with family for decisions, prefer branded products, structured shopping. Bargain Hunters: Actively bargain, opportunistic shopping, buy varied products online and offline. Routine Shoppers: Make decisions themselves, fixed shopping patterns, prefer specific shops and timings.
Thank You
¶
Thank You




